windows_path __repr__ does not work if it contains surrogate characters#181
Conversation
|
@yunzheng and @JSCU-CNI are there ways to make this more robust? See fox-it/dissect.target#1276 for more information about the issue. |
yunzheng
left a comment
There was a problem hiding this comment.
Can you also add an unit test?
I think this should be good enough, as this is only when printing to the terminal for viewing consumption.
What it should print is a different matter though. If one would expect that this "\udce4" should be printed as "ä" (i'm not sure if that is the case here as well in your issue). Then this needs to be fixed with the correct encoding when storing it as a utf-8 string. I'm not sure if shellbag is always a fixed windows encoding or dependent on the language setting of the OS.
|
@respondersGY thank you for your contribution! As this is your first code contribution, please read the following Contributor License Agreement (CLA). If you agree with the CLA, please reply with the following information:
Contributor License Agreement
Contribution License AgreementThis Contribution License Agreement ("Agreement") governs your Contribution(s) (as defined below) and conveys certain license rights to Fox-IT B.V. ("Fox-IT") for your Contribution(s) to Fox-IT"s open source Dissect project. This Agreement covers any and all Contributions that you ("You" or "Your"), now or in the future, Submit (as defined below) to this project. This Agreement is between Fox-IT B.V. and You and takes effect when you click an “I Accept” button, check box presented with these terms, otherwise accept these terms or, if earlier, when You Submit a Contribution.
|
|
@DissectBot agree [company="Responders B.V."] |
`UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 38: invalid continuation byte`
|
I added a test that triggers |
`UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 38: invalid continuation byte`
be7cef5 to
b410276
Compare
|
This more looks like an issue with windows_path's While the normal path does |
|
So it indeed looks more to be an issue with class windows_path(pathlib.PureWindowsPath, path):
def __repr__(self) -> str:
s = str(self)
# Only use repr() if we have surrogates that need escaping
try:
s.encode('utf-8')
except UnicodeEncodeError:
# Has surrogates - use repr but fix the over-escaping
s = repr(s)[1:-1] # This escapes surrogates as \udcXX
s = s.replace('\\\\', '\\') # Fix double backslashes
s = s.replace("\\'", "'") # Fix over-escaped quotes
s = s.replace('\\"', '"') # Fix over-escaped double quotes
quote = "'"
if "'" in s:
if '"' in s:
s = s.replace("'", "\\'")
else:
quote = '"'
return f"{quote}{s}{quote}"Then it seems to work (without the change in the RecordPrinter) |
|
Added in I added |
yunzheng
left a comment
There was a problem hiding this comment.
added small remark, looks good otherwise
errors=surrogateescape
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #181 +/- ##
==========================================
+ Coverage 83.03% 83.07% +0.03%
==========================================
Files 34 34
Lines 3602 3609 +7
==========================================
+ Hits 2991 2998 +7
Misses 611 611
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thank you for the help and the review! |
Closes fox-it/dissect.target#1276